Introduce support for Reciprocal Rank Fusion (combining queries)#2489
Introduce support for Reciprocal Rank Fusion (combining queries)#2489alessandrobenedetti wants to merge 14 commits intoapache:mainfrom
Conversation
|
As of today the solution is clean enough to be reviewed and tests are green.
|
| Map<Integer, Integer[]> docIdToRanks; | ||
| docIdToRanks = new HashMap<>(); | ||
| for (int docID : combinedResultsDocIDs) { | ||
| docIdToRanks.put(docID, new Integer[docLists.size()]); |
There was a problem hiding this comment.
To do what? Isn't Integer.valueOf(int a) used to wrap an integer?
Can you elaborate?
dsmiley
left a comment
There was a problem hiding this comment.
I haven't looked at this in detail but I'm wondering could we instead have a single 'q' query parser, that is itself a "query combiner" of its children clauses? Ultimately this would mean not needing all this special code in QueryComponent. I'm trying to make recommendations that "fit in" well with Solr's architecture; less bolted / hacked on.
CC @hossman I think you'd have good inputs on this one too
| "count(//lst[@name='explain']/*)=3", | ||
| "count(//lst[@name='explain']/*)=6", |
There was a problem hiding this comment.
How does this PR affect this assertion of an existing test?
There was a problem hiding this comment.
I added more docs in the setup of the test class, for the new debug tests I added.
I thought that changing the few lines of debug that were counting the total docs was not much of a big deal.
I can obviously change that if you think it is worth.
There was a problem hiding this comment.
I suppose I'd need to see a before & after to truly "see" the change.
solr/core/src/java/org/apache/solr/request/json/RequestUtil.java
Outdated
Show resolved
Hide resolved
In principle I agree, correct me if I'm wrong, your proposed solution means:
I would need to think and design this possible solution, but I have no idea when this will happen as I am afraid it would take a significant amount of time. (of course anyone is happy to help). Do you think it would make sense to start refining this contribution, bake it in so that users have this new possibility decently quickly and then iteratively replace it? It's not because I have a strong opinion on this, I am just realistically think what could bring value from my side, in accordance with my future availabilty. |
|
Want to quickly respond to this:
No! The Lucene project is not the only place where subclasses of Query classes exist ;-) I'm leary of blessing committing something architecturally unclean (in my mind; no offense) that will supposedly be fixed later. I confess I'm not going to veto this as it's not all that bad really; it reads clearly, I understand this part of it so I shouldn't complain too loudly. But personally I'm not going to approve this PR and this wont' block your merge. There are trade-offs admittedly... a special/hacky Query as I proposed (assuming it needs to be hacky?) is its own abuse of sorts; we have another -- PostFilter that isn't really a Query. But at least those are contained to a whole abstraction implementation. Meaning, ignore it if you don't care about it. But QueryComponent is doing a lot; it's already complex... more conditions and special cases here is something we should watch out for. QueryComponent is on the critical path; a place to fight against complexity. Long ago QC was relatively clean and then query grouping came along and well, what a mess that made! Overall I wish we made proposals on the dev list about our ideas before showing up with an implementation. I want my colleagues to do the same so peer review can influence the approach early. I love reviewing other people's ideas before time is invested in an implementation! I had an ad-hoc call with another committer this week to do this sort of thing. I'll see if I can look closer at your PR to think more about a specific recommendation. Sigh; I'm busy though |
dsmiley
left a comment
There was a problem hiding this comment.
I should have spent just another 10 minutes reading this to understand what's going on; I understand it better now. I had overlooked the query loop. I don't think a new Query would be that helpul for this use-case after-all. Even if one were to exist, somewhere there would be code to detect it and loop it so where would that go... either QueryComponent or SolrIndexSearcher which is another complexity minefield of greater magnitude. It's tempting to think of a QueryComponent subclass for this use-case (and for that matter grouping).
BTW I like the query syntax, especially the JSON variant.
Sorry for any distress; +1 overall.
No distress David! I'm not the kind of developer that get attached to its own solutions. I also agree in principle with proposing ideas first, before spending days inplementing them, the only reason that I don't do that often is because I am not full time on Solr contributions right now, actually I end up at the moment with limited time to do them, so I prefer to produce something that could be valuable and iterate. I remember at the beginning I tried query parsing and other approaches but the cleanest I got was the loop in the query component. And I absolutely agree that component is messy, so after Berlin I'll spend a bit more time polishing the changes, and if in the meantime we come up with better ideas, the better! No rush to merge, but at the same time if we get a decent consensus I would love to have this in as this feature will open many doors! Thanks for the time you already spent on this, it's always very much appreciated! |
dsmiley
left a comment
There was a problem hiding this comment.
I could still imagine a query parser being used to express the inputs, and might map to users understanding of things. Users know query parsers, know they can be rich and form a tree, know it matches documents. To the user, maybe this is a special query even though from an implementation perspective, it wouldn't really be implemented as such because the clauses needed to each be evaluated independently. The approach would get more logic out of QueryComponent (the feature parsing aspect). It would add more net code, however, albeit not much. It might better encapsulate the feature, even though the QueryComponent is an integral piece of what implements it. Heck, we could think of this as giving the QueryComponent to a special Query if we want, thus enabling interesting use cases of queries that can be in charge of producing their document list and other stuff.
solr/core/src/java/org/apache/solr/search/combining/QueriesCombiner.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/search/combining/QueriesCombiner.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
Outdated
Show resolved
Hide resolved
|
|
||
| String COMBINER_ALGORITHM = COMBINER + ".algorithm"; | ||
|
|
||
| String COMBINER_KEYS = COMBINER + ".keys"; |
There was a problem hiding this comment.
WDYT of instead do combine.q multi-valued, so just supply each query as another value? One less indirection and thing to name.
There was a problem hiding this comment.
mmmmm, I do believe it can be useful to have a "query key", especially because the debug component explains both the query parsing and combination:
0.032522473 =
1/(60+1) + 1/(60+2) because its ranks were: 1 for query(lexical), 2 for query(lexical2)
So the keys help in my opinion.
If you were referring to this specific parameter, this is not intended for the user, it gets used automatically by Solr.
Happy to elaborate!
There was a problem hiding this comment.
...
combine.qmulti-valued ...
In the case of multi-valued parameters, is ordering 'guaranteed' or could &combine.q=Foo&combine.q=Bar turn into &combine.q=Bar&combine.q=Foo and if it did would it matter?
There was a problem hiding this comment.
AFAIK it's guaranteed; we use a List not a Set intentionally
Co-authored-by: David Smiley <dsmiley@apache.org>
|
I did the work of migrating to a QParser but don't have permissions to push to your PR branch. Beware, I'm reverting most/all of the files that you modified of existing Solr source files. If you would prefer I submit a separate PR, I'll do that. Instead of Wouldn't the docSet (assuming it's needed, like for faceting) be computed on each subQuery, and if so how is that rolled up to the final QueryResult? I see you tested that faceting works but I suspect there may be a very sad performance bug here. |
I wasn't entirely sure a Query Parser is the right place for this, given we are not really building a "new query" but then I thought : "hey we have the boolean query parser, that more or less does the same", so I have nothing against it! You can go ahead and add your commit, I don't care much if you overwrote many of my changes, I just want a nice and clean solution, however we come up to it! Worst case scenario we go back in commits and re-use what's necessary!
In regards to the docSet It was rolled up in: combinedResultsDocIds is the merged array of doc Ids. |
Revert DebugComponent, ResponseBuilder, SolrPluginUtils
dsmiley
left a comment
There was a problem hiding this comment.
My commit here is itself a draft but significantly reduces the impact to existing Solr source files. Combiner tests pass.
I didn't tend to the debug-ability aspect very much. Just a draft :-)
| if (cmd.getQuery() instanceof SelfExecutingQuery) { | ||
| // TODO QueryResult ought not to be created and passed in methods of QueryComponent | ||
| result = ((SelfExecutingQuery) cmd.getQuery()).search(searcher, cmd); | ||
| } else { |
There was a problem hiding this comment.
Here's where the minimal essential bit to alter QueryComponent by adding a new abstraction (SelfExecutingQuery) that might be useful for other features; not just "combining".
| convertJsonPropertyToLocalParams( | ||
| newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr); | ||
| } | ||
| newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack |
There was a problem hiding this comment.
Making RequestUtils add a param specific to a niche feature is a hack. If we really need this here (?), I think we should use a generic name like "json.queries".
There was a problem hiding this comment.
+1 for a generic name
json.queries looks all right to me
| public void search(QueryResult qr, QueryCommand cmd) throws IOException { | ||
| try { | ||
| getDocListC(qr, cmd); | ||
| } catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug |
There was a problem hiding this comment.
I moved the FuzzyTermsException catch block from QueryComponent to here. I wanted it in a deeper place that would be caught by multiple inward code paths instead of just QC. It still feels it ought to be deeper but I didn't look for a better spot. CombineQuery calls into this.
There was a problem hiding this comment.
It seems all right to me, but I don't have a strong opinion
| import org.apache.solr.search.SolrIndexSearcher; | ||
| import org.apache.solr.search.SyntaxError; | ||
|
|
||
| public class CombineQParserPlugin extends QParserPlugin { |
There was a problem hiding this comment.
This source file captures the essential aspect of my proposal. Needs another iteration of refinement at least (e.g. javadocs, and param naming / defaulting). Maybe more.
| } | ||
| } | ||
|
|
||
| static final class CombineQuery extends ExtendedQueryBase |
There was a problem hiding this comment.
A "SelfExecutingQuery" isn't normal; Lucene isn't going to work with this thing since it doesn't implement getWeight nor can it. But users don't need to know/care.
| results[i] = qr; | ||
| } | ||
|
|
||
| // nocommit but how is the docSet (e.g. for faceting) or maybe other things supported? |
There was a problem hiding this comment.
This seems like a fundamental limitation in the whole feature honestly. What Hossman said about re-rank query is interesting... not sure if it addresses this.
There was a problem hiding this comment.
mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)
| "count(//lst[@name='explain']/*)=3", | ||
| "count(//lst[@name='explain']/*)=6", |
There was a problem hiding this comment.
I suppose I'd need to see a before & after to truly "see" the change.
alessandrobenedetti
left a comment
There was a problem hiding this comment.
I like your changes David, they are less intrusive than my original version!
let me know once you do an additional refinement and I'll review them again
| req.getContext().put(SolrIndexSearcher.STATS_SOURCE, statsCache.get(req)); | ||
|
|
||
| // TODO QueryResult ought not to be created and passed in methods of QueryComponent. | ||
| // Should be created ~exclusively in SolrIndexSearcher and returned by it. |
There was a problem hiding this comment.
I tend to not be a huge fan of comments in the code, what does this mean? it's just for your convenience in the draft?
If that's the case, ignore my comment
| convertJsonPropertyToLocalParams( | ||
| newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr); | ||
| } | ||
| newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack |
There was a problem hiding this comment.
+1 for a generic name
json.queries looks all right to me
| public void search(QueryResult qr, QueryCommand cmd) throws IOException { | ||
| try { | ||
| getDocListC(qr, cmd); | ||
| } catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug |
There was a problem hiding this comment.
It seems all right to me, but I don't have a strong opinion
| @Override | ||
| public Query parse() throws SyntaxError { | ||
|
|
||
| var queriesToCombineKeys = localParams.getParams("keys"); |
There was a problem hiding this comment.
I am not a huge fan of var as in my opinion it reduces readability, but if it's ok as stadnard in Solr, no strong opinion
| } | ||
|
|
||
| @Override | ||
| public void addDebugInfo(NamedList<Object> dbg) { |
There was a problem hiding this comment.
maybe 'addQueryDebugInfo' ?
| results[i] = qr; | ||
| } | ||
|
|
||
| // nocommit but how is the docSet (e.g. for faceting) or maybe other things supported? |
There was a problem hiding this comment.
mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)
| "inStock_b1", | ||
| "false")); | ||
| assertU( | ||
| adoc("id", "6", "title", "Gohan is the first saiyan-human hybrid.", "inStock_b1", "false")); |
There was a problem hiding this comment.
as you can see @dsmiley here I added more docs, that's the reason they are now 6
|
|
||
| String COMBINER_KEYS = COMBINER + ".keys"; | ||
|
|
||
| String RECIPROCAl_RANK_FUSION = "rrf"; |
There was a problem hiding this comment.
| String RECIPROCAl_RANK_FUSION = "rrf"; | |
| String RECIPROCAL_RANK_FUSION = "rrf"; |
| var blendParams = SolrParams.wrapDefaults(localParams, params); | ||
| queriesCombiningStrategy = QueriesCombiner.getImplementation(blendParams); | ||
|
|
||
| String defType = blendParams.get(QueryParsing.DEFTYPE, DEFAULT_QTYPE); |
There was a problem hiding this comment.
Maybe could support different defType for different keys, though bit of a niche thing that perhaps.
| } | ||
| combinedRankedList.setSegmentTerminatedEarly(segmentTerminatedEarly); | ||
|
|
||
| combinedRankedList.setNextCursorMark(rankedLists[0].getNextCursorMark()); |
There was a problem hiding this comment.
Not sure about picking the first list here and/or what does use of a cursor mark mean when combining queries. Alternative might be to just disallow cursor marks when combining queries.
|
@cpoerschke see the parent JIRA issue for the real state of this controvertial PR |
Thanks for the pointer! It inspired me to do some #2597 scribbling. |
|
This PR has had no activity for 60 days and is now labeled as stale. Any new activity or converting it to draft will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. Thank you for your contribution! |
|
This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution! |
|
This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again. |
https://issues.apache.org/jira/browse/SOLR-17319
Description
Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple ranked lists to produce a unified result set.
Examples of use cases where RRF can be used include hybrid search and multiple Knn vector queries executed concurrently.
RRF is based on the concept of reciprocal rank, which is the inverse of the rank of a document in a ranked list of search results.
The combination of search results happens taking into account the position of
the items in the original rankings, and giving higher score to items that are ranked higher in multiple lists. RRF was introduced the first time by Cormack et al. in [1].
The syntax proposed:
JSON Request
[1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)
Solution
The support has been introduced leveraging the "queries" support in the JSON request syntax.
The combination of results happen at QueryComponent, processing level.
The debug component has the responsibility of managing the explainability part.
Limitations:
Tests
Tests have been added to cover the main use cases
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.